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Abstract 

This paper is devoted to the theory and application of a novel class of models 
for binary data, which we call log-mean linear (LML) models. The characteriz- 
ing feature of these models is that they are specified by linear constraints on the 
LML parameter, defined as a log- linear expansion of the mean parameter of the 
multivariate Bernoulli distribution. We show that marginal independence rela- 
tionships between variables can be specified by setting certain LML interactions 
to zero and, more specifically, that graphical models of marginal independence 
are LML models. LML models are code dependent, in the sense that they are 
not invariant with respect to relabelling of variable values. As a consequence, 
they allow us to specify sub-models defined by code-specific independencies, 
which are independencies in sub-populations of interest. This special feature 
of LML models has useful applications. Firstly, it provides a flexible way to 
specify parsimonious sub-models of marginal independence models. The main 
advantage of this approach concerns the interpretation of the sub-model, which 
is fully characterized by independence relationships, either marginal or code- 
specific. Secondly, the code-specific nature of these models can be exploited to 
focus on a fixed, arbitrary, cell of the probability table and on the correspond- 
ing sub-population. This leads to an innovative family of models, which we call 
pivotal code-specific LML models, that is especially useful when the interest of 
researchers is focused on a small sub-population obtained by stratifying individ- 
uals according to some features. The application of LML models is illustrated 
on two datasets, one of which concerns the use of pivotal code-specific LML 
models in the field of personalized medicine. 
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1 Introduction 



A straightforward way to parameterize the probabihty distribution of a set of categor- 
ical variables is by means of their probability table. Probabilities are easy to interpret 
but have the drawback that sub-models of interest typically involve non-linear con- 
straints on these parameters. For instance, conditional independence relationships 



can be specified by requiring certain factorizations of the cell probabilities; see Lau- 



ritzen (1996) and Cox and Wermuth (1996). For this reason, it is useful to develop 



alternative parameterizations such that sub-models of interest correspond to linear 
subspaces of the parameter space of the saturated model. Log-linear parameters are 
computed as a log-linear expansion of the cell probabilities (see Agresti, 2002) and 
can be seen as a special case of the wider family of marginal log-linear parameters of 



Bergsma and Rudas ( 2002 ) . These parameters are defined by first specifying a suit- 



able sequence of marginal distributions and then by selecting a subset of the log-linear 
parameters computed from each margin of this sequence. The traditional log-linear 
parameters are obtained when the specified sequence only includes the "marginal" 
distribution of all the variables. Otherwise, when all possible marginal distributions 
are included in this sequence, one obtains the multivariate logistic parameters first 
introduced by Glonek and McCuUagh (1995). 

It is well-established that pairwise conditional independence relationships corre- 
spond to certain zero entries in the vector of log-linear parameters and, more generally, 
undirected graphical models for discrete data are sub-families of log-linear models; see 



Lauritzen (1996, Chap. 4). Several recent papers investigate the use of marginal-log 



linear models to specify other classes of graphical models as well as models defined 



by arbitrary collections of conditional independencies; see for instance Evans and 



Richardson (2011), Forcina et al. (2010) and Rudas et al. (2010). 



A special case of interest is the family of graphical models of marginal indepen- 
dence introduced by Cox and Wermuth (1993, 1996) with the name of covariance 
graph models, but later addressed in the literature also as bidirected graph models 
following Richardson (2003). These models have appeared in several applied con- 
texts as described in Drton and Richardson (2008) and references therein. Defining a 
parameterization for discrete marginal independence models represents a challenging 
research area both because pairwise independencies do not imply higher order inde- 
pendencies as, for instance, in the Gaussian case, and because a parameterization for 
these models should also allow the flexible implementation of additional substantive 



constraints to facilitate the selection of parsimonious models; see Evans and Richard- 



son 



(2011). In the case of binary data, Drton and Richardson (2008) showed that 
bidirected graph models can be parameterized by imposing multiplicative constraints 
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on the mean parameter, which they called the Mohius parameter, of the multivariate 
Bernoulli distribution. Successively, Lupparelli et al. (2009) showed that bidirected 
graph models for discrete data are a subclass of multivariate logistic models. 

In this paper we consider binary data and introduce a novel parameterization 
based on a log-linear expansion of the mean parameter of the multivariate Bernoulli 
distribution that we call the log-mean linear (LML) parameterization. Similarly to 
the multivariate logistic parameterization, the computation of LML parameters is 
based on all marginal distributions, but we remark that they are not marginal log- 
linear parameters, with the advantage that their computation from cell probabilities 
is more efficient. The LML parameterization defines a parameter space where the 
multiplicative constraints in the Mobius parameters of Drton and Richardson (2008) 
correspond to linear subspaces and, from this perspective, they resemble the con- 
nection between log-linear parameters and cell probabilities. It is therefore natural 
to investigate the family of LML models obtained by imposing linear constraints on 
the parameter space of the saturated model. We show that marginal independence 
between variables can be specified by setting certain LML parameters to zero and, 
more specifically, that graphical models of marginal independence are LML mod- 
els. Our approach to the problem seems especially appealing because it avoids some 
disadvantages of both the Mobius parameterization and the multivariate logistic one. 

The LML models are code dependent in the sense that they are not invariant with 
respect to relabelling of the variables. We show that this makes it possible to spec- 
ify sub-models defined by code-specific independencies, which are independencies in 
sub-populations of interest. This specific feature of LML models has useful applica- 
tions. Firstly, it provides an alternative way to specify parsimonious bidirected graph 
sub-models with the advantage that all the constrains can be easily interpreted as 
independencies. More specifically, the model is defined by the collection of marginal 
independencies encoded by a bidirected graph under a given Markov property to- 
gether with an additional collection of code-specific independencies. A simulation 
study suggests that the coding of variables can be specified in such a way that statis- 
tical procedures for testing code-specific independencies have appealing asymptotic 
properties. Secondly, the code-specific nature of these models can be exploited to 
focus on a specific cell of the probability table and the corresponding sub-population. 
This leads to an innovative family of models, which we call pivotal code-specific LML 
models, that is especially useful when the interest of researchers is focused on a small 
sub-population specified by stratifying individuals according to some features. The 
application of LML models is illustrated on two datasets, one of which concerns the 
use of pivotal code-specific LML models in the field of personalized medicine. 
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Although this work is devoted to binary variables, we also show that our results 
provide the building blocks for the extension of LML models to categorical variables 
with an arbitrary number of levels. 

The remainder of the paper is organized as follows. In Section |2] we review the 
most common parameterizations of the multivariate Bernoulli distribution and their 
use in graphical modelling with special attention to marginal independence models. In 
Section |3] we introduce the LML parameterization and the associated class of models. 
Section |4] shows that LML models can be used to define marginal independence models 
such as bidirected graph models. Section [5] is devoted to the theory of LML models 
defined with code-specific independencies. Section [6] gives two different applications 
of LML models. Section [7] deals with the non-binary case and, finally. Section |8] 
contains a discussion. Basic lemmas and long proofs are deferred to Appendices |X] 
and|B} respectively, whereas Appendix [C] gives an algorithm for maximum likelihood 
estimation in LML models. 



2 Preliminaries 

In this section we introduce the multivariate Bernoulli distribution and describe its 
most commonly used parameterizations. Furthermore, we review the theory related 
to graphical models of marginal independence at the level requested for this paper; 



we refer to Richardson (2003) and Drton and Richardson (2008) for further details. 



2.1 Parameterizations for binary data 

Given the finite set V = {!,... ,p}, with \V\ = p, let Xy = {X^)v^v be a random 
vector of binary variables taking values in the set Xy = {0, 1}^. We call Xy a 2P-table 
and its elements iy G Xy the cells of the table. In this way, Xy follows a multivariate 
Bernoulli distribution with probability table n{iy), iy G Xy, which we assume to be 
strictly positive. From the fact that Xy = {0,1}*' = {{lD,Oy\£))\D C V}, it follows 
that we can write the probability table as a vector = (vr£))£)cv with entries vr]^ = 
P{X£) = l£),Xy\£) = Oy\D). We refer to vr^ as to the probability parameter of Xy 
and recall that it belongs to the (2^ — l)-dimensional simplex, which we write as 
TT^ G n^. For every U ^ V, with [/ 7^ 0, Xjj = {X^)y^u denotes the marginal 
Bernoulli random vector taking values in the set lu = {0, 1}'^', and the quantities 
ir^ and 11*^ are defined accordingly. Hereafter, whenever U = V we simplify the 
notation and omit the superscript, so that vr = vr^ and 11 = 11^. 

We call 6 = 6^ a parameter of Xy if it is a vector in R^'' that characterizes the 
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joint probability distribution of Xy, and use the convention that the entries of 6 
(called interactions) are indexed by the subsets of V, i.e., 9 = {9d)dcv- H uj = is 
an alternative parameter of Xy, then a result known as Mohius inversion states that 

ujd=Y.^e^ ^DCV ^ 0D=Y1 C V; (1) 

BCD BCD 



see, among others, Lauritzen (1996, Appendix A). Let Z = Z and M = M be two 



(2^ X 2^) matrices with entries indexed by the subsets of V x V and given by 

Zd,h = 1{DCH) and Md,h = {-l)^"^^^l{D C H), 

respectively, where l(-) denotes the indicator function. Then, the equivalence ([T| can 
be written in matrix form as 

cu = Z^e iff ^ = M^a;, (2) 

and Mobius inversion follows by noticing that M = Z~^. In the literature M is usually 
called the Mobius matrix whereas Z is the zeta-matrix. 

In the remaining part of this subsection, we review some well-known alternative 
parameterizations for the distribution of Xy, each defined by a smooth invertible 
mapping from 11 onto a smooth (2^ — l)-dimensional manifold of B?^ . For simplicity, 
we denote both the mapping and the alternative parameter it defines by the same 
(greek) letter. We remark that, if the inverse mapping can be analytically computed, 
then the likelihood function under multinomial sampling can be written in closed 
form as a function of the alternative parameter; Poisson sampling can be dealt with 
by extending the mapping domain to R^^ . 

Multivariate Bernoulli distributions form a regular exponential family with canon- 
ical log-linear parameter A = computed as 

A = M^log7r. (3) 

The mapping A defined by (|3]) can be easily inverted by exploiting Mobius inversion 
to obtain vr = expZ'''A, for all A G A(n). Notice that A(n) has a simple structure, 
since {\d)d^% is free to vary in R^^~^ and A0 is a smooth function of its elements. 
The parameterization A captures conditional features of the distribution of Xy and is 
used to define the class of log-linear models, which is widely used for the analysis of 



discrete data (see Agresti, 2002) and includes, as a special case, the class of undirected 



graphical models (see Lauritzen 1996 Chap. 4). 

The mean parameter of the multivariate Bernoulli distribution is /x = (/iD)_Dcy, 
where /X0 = 1 (on grounds of convention) and /id = P{Xri = Id) otherwise. This was 
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called the Mobius parameter by 



Drton and Richardson 



(2008), because one finds 



fi = Ztt. 



(4) 



The linear mapping fx defined by ^ is trivially Mobius-inverted to obtain vr = M/i, for 
all fi G Ai(n). However, it should be said that the structure of /i(n) is rather involved, 
and actually well-understood only for small p. The parameterization fi satisfies the 
upward compatibility property, i.e., it is invariant with respect to marginalization. 



Drton and Richardson ( 2008 ) used the mean parameterization in the context of graph- 



ical models of marginal independence. 



Bergsma and Rudas (2002) developed a wide class of mixed parameterizations. 



that is parameterizations capturing both marginal and conditional distributional fea- 
tures, named marginal log-linear parameterizations. Broadly speaking, any marginal 
log-linear parameter is obtained by stacking subvectors of log-linear parameters com- 
puted in suitable marginal distributions. Thus, this class of parameterizations in- 
cludes as special, extreme, cases the log-linear parameterization A, where a single 



margin is used, and the multivariate logistic parameterization of Glonek and Mc- 
CuUagh (1995), rj = (r7D)_Dcy, where each rju is computed in the margin X^. The 



parameterization rj clearly satisfies the upward compatibility property, and it is typ- 



ically used for modelling marginal distributions; see McCuUagh and Nelder (1989 



Chap. 6). More generally, marginal log-linear parameterizations can be useful in sev- 



eral contexts (see Bergsma et al. , 2009 ) and, in particular, they have been recently 



used for different types of graphical models of marginal independence (Lupparelli 



et al.[ |2009t [Rudas et al.[ |2010[ [Marchetti and Lupparelh[ |2011[ [Evans and Richard^ 



son 



2011). A disadvantage of marginal log-linear parameterizations is that their 



inverse mappings cannot be analytically computed (but for the special case of A). 



2.2 Marginal independence and bidirected graph models 

Marginal independence models aim to capture marginal independence relationships 
between variables. Dealing with marginal independence is especially challenging in 
the discrete case because pairwise independencies do not imply higher order indepen- 
dencies as, for instance, in the Gaussian case. In this subsection we focus on graphical 



models of marginal independence and, following Richardson (2003), we use the con 



vention that the independence structure of variables is represented by a bidirected 
graph. It is also worth recalling that graphical models of marginal independence 



have been previously discussed by Cox and Wermuth (1993), who adopt a different 



graphical representation with undirected dashed edges. 
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Figure 1: Example of bidirected graph with disconnected sets {1,3}, {1, 4}, {2, 4}, {1, 2, 4} 
and {1, 3, 4}, encoding the marginal independencies -^^{1,2} -LL-'^4 and Xi _LLX|3 4} under the 
connected set Markov property. 



A bidirected graph Q = (V, E) is defined by a. set V = {1, ... ,p} of nodes and a set 
E of edges drawn as bidirected; see Richardson (2003). Two nodes j, k are adjacent if 
jk is an edge of Q, whereas two edges are adjacent if they have an end-node in common. 
A path from a node j to a node A; is a sequence of adjacent edges connecting j and k 
for which the corresponding sequence of nodes contains no repetitions. For any subset 
D C V of nodes, D is said to be connected in Q if there is a path between every couple 
j, k & D, otherwise it is said to be disconnected in Q. Any disconnected set D C V 
can be uniquely partitioned into its (maximal) connected components Ci, . . . ,Cr such 
that D = Ci U ■ ■ ■ U Cr, Ci n Cj = (!) whenever i ^ j , Ci is connected for all i, and 
Ci U {v} is disconnected for all v G D\Ci. 

A bidirected graph model is the family of probability distributions for Xy that 
satisfy a given Markov property with respect to a bidirected graph Q. Given three 
disjoint subsets A,B,C C V, we use the standard notation Xa^Xb \ Xc (Dawid 



1979) to represent the statement "X^ is independent of Xb given Xc". The distribu- 



tion of Xy satisfies the pairwise Markov property if, for every missing edge jk, it holds 
that Xj iLXfc. The distribution of Xy satisfies the connected set Markov property 
(Richardson, 2003) if, for every disconnected set D, the subvectors corresponding to 
its connected components Xc^, . . . ,Xc^ are mutually independent. For instance, in 
the bidirected graph of Figure[l| the set {1, 2, 4} is disconnected with connected com- 
ponents {1, 2} and {4}. Then, under the connected set Markov property Xji 2} iLX 



'-4, 



implying X1ALX4 and X2-I-LX4. Notice that the connected set Markov property im- 
plies the pairwise Markov property, whereas the converse is not true in general, even 
for strictly positive distributions. We denote by B(^) the bidirected graph model for 
Xy defined by Q under the connected set Markov property. 



Parameterizations for the class B(^) have been studied by Drton and Richard- 



son 



(2008), where B(^) is parameterized by imposing non-linear constraints on the 
Mobius parameterization /i, and by Lupparelli et al. (2009), where B(^) is param- 
eterized by imposing linear constraints on the multivariate logistic parameterization 
rj] see also Evans and Richardson (2011). In the next section we introduce a novel 
parameterization for binary data, which we find useful to introduce a new family of 
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models, and we compare its features to the parameterizations reviewed above. 



3 Log-mean linear models 

We introduce a new class of models for the multivariate Bernoulli distribution based 
on the notion of Log-Mean Linear (LML) parameter, denoted by 7 = (7D)Dcy- Each 
element 7d of 7 is a log-linear expansion of a subvector, namely {^e)ecd, of the mean 
parameter: 

7D=Y1 (-1)"'^'" ^^SM^ (5) 

BCD 

SO that in vector form we have 

j = Wflogn. (6) 



Notice that by replacing /x with vr in (|6j) one obtains the canonical log-linear parameter 
A in g. 

It is worth describing in detail the elements of 7 corresponding to sets with low 
cardinality (its low-order interactions). Firstly, and trivially, 70 = log/i0 is always 
equal to 0. Secondly, for every j G V, the main LML effect 7{j} = log fi^jy is always 
negative, because is a probability. Then, for every j, k & V, the two-way LML 
interaction 

7{i,fc} = log 

/i{j}Ai{fc} 

coincides with the logarithm of the second order dependence ratio introduced by 



Ekholm et al. (1995); see also Ekholm et al. (2000) and Darroch and Speed (1983) 



where dependence ratios were used in models named Lancaster additive. Finally, the 
three-way LML interaction, for every triple j, k,z & V, is equal to 

, fi{j,k,z}fi{j}fi{k}f^{z} 
7{j,k,z} = log 

and thus differs from the third order dependence ratio; the same is true for each 7^ 
with \D\ > 3. Note that, already from two-way LML interactions, it is apparent that 



7 is not a marginal log-linear parameter of Bergsma and Rudas (2002) 



We now formally define the LML parameterization as a mapping from 11. 

Definition 1 For a vector Xy of binary variables, the log-mean linear parameteri- 
zation 7 is defined by the mapping 

7 = M^logZ7r, TT G n. (7) 
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Notice that ([T]) follows from ^ and Q. Although ([T]) is of the form of the marginal 
log-linear parameterizazion of Bergsma and Rudas| ( |2002 ) , in the latter the contrast 



and marginalization matrices corresponding to M and Z in ([T]) are rectangular ma- 
trices of size t X 2^ with t ^ 2^, so that the inverse transformation is not available 
in closed form. On the other hand, in our case the inverse transformation can be 
analytically computed by applying Mobius inversion twice to obtain 

7r = MexpZ^7, 7 G 7(0). (8) 

Clearly, the bijection specified by ([T]) and (|8| is smooth, so that it constitutes a valid 
reparameterization. Finally, we remark that, as well as the mean and multivariate 
logistic parameterizations, the LML parameterization satisfies the upward compati- 
bility property. 

Given a full rank matrix EI of size (2^ x k), with k < 2^, whose rows are indexed 
by the subsets of V, we define the LML model constrained by EI as follows. 

Definition 2 For a vector Xy of binary variables and a full rank (2^ x k) matrix EI, 
the log-mean linear model r(EI) is the family of probability distributions for Xy such 
that e^7 = 0. 

It is not difficult to construct a matrix EI such that r(EI) is empty. However, the 
family r(EI) is non-empty if the linear constraints neither involve 70 nor the main 
effect 7{j}, for every j G V. More formally, a sufficient condition for r(IHI) to be 
non-empty is that the rows of EI indexed hj D C V with |Z}| < 1 be all equal to 0; 
see Section HTTl 

Proposition 1 Any non-empty log-mean linear model r(EI) is a curved exponential 
family of dimension (2^ — A; — 1). 

Proof. This follows from the mapping defining the parameterization 7 being smooth, 
and the matrix EI imposing a /c-dimensional linear constraint on the parameter 7. □ 

Maximum likelihood estimation for LML models under a multinomial or Poisson 
sampling scheme is a constrained optimization problem, which can be carried out by 
using standard algorithms. In particular, we adopt an iterative method also used for 
fitting marginal log-linear models; see Appendix [C] for details. It should be noted that 
in our case the algorithm is computationlly more efficient than for marginal log-linear 
models, because, as remarked above, rectangular matrices of size t x 2^ with t 3> 2^* 
are replaced by square matrices of size 2^ x 2^. 

The LML parameterization, like the mean parameterization /i, is not symmet- 



ric under relabelling of the two states taken by the random variables (Drton and 
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Richardson, 2008). As a consequence, a key issue concerning the specification of EI is 

on variable coding and the forthcoming Sections |4] 



the possible dependence of F 
and m show how the matrix EI can be constructed so as to obtain useful models. 



4 Marginal independence models 

In this section we show that the LML parameterization 7 can be used to encode 
marginal independencies and, furthermore, that bidirected graph models are LML 
models. Hence, the LML parameterization can be used in alternative to the ap- 



proaches developed by Drton and Richardson (2008) and Lupparelli et al. (2009) 



Our approach is appealing because it combines the advantages of the Mobius param- 
eterization /i and of the multivariate logistic parameterization 77: the inverse map 
7 I— !■ TT can be analytically computed, as for fi, and the model is defined by means of 
linear constraints, as for rj. 

4.1 LML models and marginal independence 

The following theorem shows how suitable linear constraints on the LML parameter 
correspond to marginal independencies. 

Theorem 2 For a vector Xy of binary variables with probability parameter tt G n, 
let fx = 11(71) in ^ and 7 = 7(7?) in Then, for a pair of disjoint, nonempty, 

proper subsets A and B of V , the following conditions are equivalent: 

(i) Xa^Xb; 

(a) fiA'uB' = fJ'A' X fJ'B' for every A' C A and B' C B; 
(iii) 'jA'uB' = for every A' C A and B' C B such that A' ^ ij} and B' 7^ 0. 



Proof. See Appendix B.l 



□ 



We remark that equivalence (i) ^ (ii) of Theorem |2] follows immediately from The- 



orem 1 of Drton and Richardson ( 2008 ) . 



The next result generalizes Theorem |2] to the case of three or more subvectors. 

Corollary 3 For a sequence Ai, . . . ,Ar of r > 2 pairwise disjoint, nonempty, subsets 
ofV, letV = {D\D C AiU-'-UAr with D ^ Aifori = l,...,r}. Then Xa^, ■ ■ ■ , Xa^ 
are mutually independent if and only if {'~fD)Dev = 0. 



Proof. See Appendix B.2 



□ 
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A matrix H, such that T{M.) is the LML model defined by the constraints of Corol- 
lary [sj has size 2^ x \T)\; the rows of EI are indexed by the subsets of V while we 
index its columns by the elements of V. The model of mutual independence is then 
specified by setting Md^d = 1 for every D E V, and all the remaining entries to 



zero. In this case, clearly, variable coding is uninfiuential. Lupparelli et al. (2009) 
showed that Xa^ , ■ ■ ■ , Corollary [3] are mutually independent if and only if 

iVD)Dev = 0. On the other hand, Drton and Richardsonj (2008) showed that the 



same independence relationship holds if and only if /i^ = YYi=i f^Ainn for all D E V. 
Consider for instance two disjoint subsets A = {j,k} and B = {z}. In this case 
V = {{j,z},{k,z},{j,k,z}}, so that Xa -U- Xb if and only if 7{j-^} = 'j{k,z} = 
l{j,k,z} = 0. This corresponds to the LML model r{M.) defined by the 2^ x 3 full 
rank matrix EI whose columns are indexed by V and with all zero entries but for 
= ^{k,z},{k,z} = ^{j,k,z},{j,k,z} = 1- The same independence model can be 
defined by either the linear constraints ?7{j,2} = V{k,z} = V{j,k,z} = or the non-linear 
constraints 

= /^{i} ^ = f^{k} X f^{z} and f^{j,k,z} = l^{j,k} x ii{z}. 

An interesting special case of Corollary [3] is given below. 

Corollary 4 For a subset A (1 V with \A\ > 1, the variables in Xa are mutually 
independent if and only if = for every D C A such that \D\ > 1. 



Proof. It is enough to apply Corollary [3j by taking A = AiU ■ ■ ■ U Ar with \Ai\ = 1 
for every i = 1, . . . ,r. □ 



We stated in Section [3] that r(EI) is non-empty whenever the rows indexed hy D CV 
with |Z}| < 1 are equal to zero. This fact follows from Corollary |4| because the latter 
implies that for mutually independent variables Xi, . . . ,Xp the constraint El'''7 = 
is satisfied. 



4.2 Bidirected graph models are LML models 

It follows from Theorem |2] that the probability distribution of Xy satisfies the pairwise 
Markov property with respect to a bidirected graph Q = (V, E) if and only if 7{j,fc} = 
whenever the edge jk is missing in Q. The following theorem shows that bidirected 
graph models for binary data are LML models also under the connected set Markov 
property. 
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Theorem 5 The distribution of a vector of binary variables Xy belongs to the bidi- 
rected graph model B{Q) if and only if its log-mean linear parameter 7 is such that 
7d = for every set D disconnected in Q . 



Proof. See Appendix B.3 



□ 



For instance, let Q be the graph in Figure [T] Its family of disconnected sets is 



I? = {{1,3},{ 



1,4}, {2,4}, {1,2,4}, {1,3,4}}. 



and the bidirected graph model B(^) is defined by the hnear constraints 

7{1,3} = 7{1,4} = 7{2,4} = 7{1,2,4} = 7{1,3,4} = 0, 

corresponding to the marginal independencies X|i 2}-l-LX4 and Xi_HX|3 4}. 

5 Code-specific independencies and LML models 

We have shown in the previous section that useful LML models can be defined by 
specifying a collection of marginal independencies (like those encoded by a bidirected 
graph under the connected set Markov property). LML models are code dependent, 
in general, but variable coding is uninfiuential, as far as only marginal independence 
constraints are specified. In this section we show that interesting LML models can 
also be defined by specifying a collection of independence relationships on special 
conditional distributions which depend on variable coding; we call these relationships 
code-specific independencies and the models they define code-specific LML models. 
We then describe two possible applications of such models: the first application con- 
cerns the specification of parsimonious sub-models of bidirected graph models which 
are easily interpretable because they are fully defined by a collection of independen- 
cies, either marginal or code-specific; in the second application the choice of a suitable 
coding leads to a class of LML models focused on a specific sub-population of interest. 

5.1 Pivot cell and partial tables 

In order to define a coding of Xy it is necessary and sufficient to specify, for every 
f G the level of X^, that takes value 1. This is equivalent to fixing, among the 2^ 
cells of the probability table, the cell to be coded as ly; we call this cell the pivot cell 
and its probability Try the pivot probability of the coding. 
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Let U C V and denote by W the complement of U with respect to V, that is, 
W = V\U. The probabihty distribution of Xt/|{Xvi/ = iw} where iw G is 
determined by the conditional probability table 

P{Xu = iu\Xw = iw) = ^tt; , iu^^u- (9) 

Hence, the probabihty parameter characterizing the distribution of = iw} 

in ([9]) is obtained by extracting a subvector of vr that is then normahzed. We note 
that such a subvector of vr contains the pivot probabihty of the coding if and only if 
iw = liy and in the following we only consider conditional distributions of this kind, 
that is, of the form Xu\{Xw = Iw}- We denote by 7r^l^-^^=^> e 11^ the probability 
parameter of the binary variables X(7|{Xvy = li^}, whose entries are 

^u\{x^=i} ^ p^^^ ^ Id.Xu\d = Ou\d\Xw = Iw) = DCU. 

fiW 

Similarly, the mean and LML parameters are = fj,(7r^\^^^^^^) and 'yi^w='^} = 

'y(^Tc^\i^w=i}'j^ respectively, where in this case the mappings in ^ and ^ involve the 
matrices M*^ and Z^, because 7r'^l^^w'=i} ^ j^. follows that the theory developed 
in the previous section can be directly applied to specify independence models for the 
distribution of = Iw}- More specifically, an LML model for X^|{Xvi/ = Iw} 

is identified by a 2'^' x k full-rank matrix EI via the linear constraint M^^y^^^^^^ = 0. 

The main result of this section states that is a linear transformation of 7, 

so that every linear constraint in 'yi^w='^} jg equivalent to a linear constraint in 7. 

Theorem 6 Let tt E H be the probability parameter of a vector Xy of binary vari- 
ables. For a nonempty subset U C V, with W = V\U, let Tr^\i^w=i} ^ 
probability parameter of Xu\{Xw = liy}- Then, the following linear relationship 
between 7 = 7(77) and 71^1^=1} ^ ^(^^u\{Xw=i}^^ holds: 



^{X^=l} ^ ^ ^^^^^ ^^Q^ 



for every DCU with D 7^ 



Proof. See Appendix B.4 



□ 



Theorem [g] says that there exists a 2^ x 21^' matrix IK = IK'^ such that ^i^w='^} = K^7 
and consequently IHI^7"i^'''"^^'"^J' = if and only if (KIHI)'^7 = 0. Denoting by K,^/), for 
D (1 U, the column of K indexed by D (containing 2^ entries indexed by the subsets 
of V) the matrix K can be constructed as follows. First, we set K0^0 = 1 and all 
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other entries of K,^0 equal to zero, so that jI^'^ = = as required. Then, 

for every D CU with ^ 0, we set Ke,d = 1 if E = DUW with W C W, and 
^E,D = otherwise. In this way we can write 

^»,d1= Y1 ^^uiy (11) 
w'cw 



and (10) can be written in matrix form as 'yjf'^ = ]Kj^7. 



As a consequence of Theorem [6} it is possible to specify a bidirected graph model 
for X[/|{Xvi/ = Iw} by means of a linear constraint on the LML parameter of Xy. 
Specifically, we have the following result under the connected set Markov property. 

Corollary 7 Under the assumptions of Theorem^ for a bidirected graph Q = (U, E), 
the probability distribution of Xij\{Xv/ = 1^} belongs to the bidirected graph model 
B{Q) if and only if any of the following, equivalent, conditions are satisfied: 

(i) ^j^^^"-"^^ = for every set D disconnected in Q; 

w'<zw^DyjW' — for every set D disconnected in Q. 

Proof. This follows immediately from Theorems [5] and |6} □ 

Condition (ii) of Corollary [t] can be written in matrix form as (Kj^^7) oeVu = 0; 
where Vu is the family of disconnected subsets of ?7 (in ^). 

More generally, we can consider the collection of code-specific independencies, 
defined as the set of all independence relationships of the form Xa-^Xb\{Xc = Ic}, 
where A,B,C is an arbitrary tern of mutually disjoint, nonempty, subsets of V. 
Then, an immediate consequence of Theorem [6] is that code-specific independencies 
correspond to linear constraints on 7 and therefore to LML models for Xy- 

Corollary 8 // Xy is a vector of binary variables, then for every tern A, B, C of 
mutually disjoint, nonempty, subsets of V the following are equivalent: 

(i) XaALXb\{Xc = Ic}; 

(ii) li^uB'^^ = for every A' C A and B' C B such that A' ^^,B' ^ 0; 

(iii) Xlc'cc = for every A' C A and B' C B such that A' 7^ 0, 5' 7^ 0. 

Proof. The equivalence (i)-v^(ii) follows from Theorem |2| whereas the equivalence 
(ii)<(=v-(iii) holds true by Theorem |6} □ 

Condition (iii) in Corollary [S can be written in matrix form as (K^^,^^,)^7 = for 
every non-empty A' C A and B' C B. Corollary |8] shows that an LML model can 
be specified by any set of code-specific independencies and the rest of this section is 
devoted to two applications of this result. 
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5.2 Parsimonious code-specific bidirected graph sub-models 



In graphical models the number of parameters depends on sparseness of the graph. 
However, unlike other families of graphical models such as models for either undirected 
graphs or directed acyclic graphs, the number of parameters in a bidirected graph 



Richardson 


(2009 


) and 


Evans 



and Richardson (2011). As a consequence, even though bidirected graph models are 



recommended when the observed variables are jointly affected by unobserved variables 
(Richardson, 2003), an analysis restricted to the family of bidirected graphs may result 
in an overparameterized model. Hence, a convenient parameterization of bidirected 
graph models should allow the flexible implementation of marginal independence 
constraints together with additional substantive constraints leading to parsimonious 
sub-models. 

Parameterizations based on marginal log-linear parameters, such as the multi- 
variate logistic parameterization, can be used to specify parsimonious sub-models by 



setting higher order interactions to zero; see Lupparelli et al. (2009) and Evans and 



Richardson (2011). In this way, parsimonious modelling can be achieved, but the 



interpretation of the constraints not directly associated with marginal independence 
is difficult. On the other hand, not being able to specify parsimonious sub- models is 



perhaps the main drawback of the Mobius parameterization of |Drton and Richardson 
(2008), which is therefore not flexible enough for this task. 

The LML parameterization has the immediate advantage that there are two differ- 
ent ways in which it can be used to specify bidirected graph sub-models. On the one 
hand, it is still possible to set higher order interactions to zero, but the interpretation 
of such constraints remains difficult. On the other hand, it is possible to include 
additional constraints in the form of code-specific independencies. This is appealing 
because code-specific independencies have a straightforward interpretation, and fur- 
thermore the resulting model is an independence model in the sense that it is fully 
defined by independence relationships: (i) the collection of marginal independencies 
encoded by the bidirected graph under a given Markov property; (ii) a collection of 
code-specific independencies. 

One should observe that the coding is arbitrary, whenever all cells are on equal 
footing, and distinct codings will result in distinct models (based on distinct sets of 
code-specific independencies). Since this fiexibility originates an indeterminacy in the 
analysis, we find it appropriate to define a criterion for choosing the coding of the 
variables when all cells are on equal footing. Specifically, we propose the maximal 
count coding, which consists in setting as the pivot cell the cell of the table with 
the largest count. Compared to alternative codings, we expect to test code-specific 
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Figure 2: Simulation results: (a) distribution of the estimated significance level; (b) 
distribution of the estimated difference in power. 



independencies in partial tables with more observations, which results in increased 
efficiency. This aspect represents a further advantage of using the LML parame- 
terization for parsimonious bidirected graph modelling, compared to any marginal 
log-linear parameterization and, in particular, to the multivariate logistic one. 

This feature is illustrated through a series of simulations aimed to compare the 
performance of the multivariate logistic and LML parameterizations to achieve par- 
simonious models, the former by setting zero higher order interactions and the latter 
by specifying code-specific independencies. In particular, given a set of four binary 
variables indexed hj V = {A, B, C, D}, we compared the performance in testing the 
hypothesis r^y = in the multivariate logistic parameterization with the performance 
in testing the hypothesis Xc ALXd\{Xa = I.Xb = 1} in the LML parameterization 
under the maximal count coding. Both hypotheses correspond to a single linear con- 
straint and both are implied by the conditional independence Xc 1\- Xi:)\{Xa, Xb}- 
We remark that the choice of the coding is uninfluential for testing zero multivariate 
logistic interactions. 

We generated a sequence of probability vectors tTj, i = 1, . . . , 40, uniformly over the 
simplex, and satisfying the constraint Xc Jl X^ \ {Xa, Xb}- Then, for each probability 
parameter tt^, we sampled 5.000 multinomial random vectors rij, j — 1, . . . , 5.000, of 
size N. Finally, for each random sample rij, we tested the two above hypotheses 
at the a — 0.05 nominal significance level, using the deviance of the corresponding 
model and the chi-squared distribution with one degree of freedom. 
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For each probability parameter vTj, we estimated the actual (finite sample) sig- 
nificance level al and of the two tests through the percentage of rejected mod- 
els in the 5.000 random samples, thus obtaining two distributions of estimates over 
the conditional independence model. The whole procedure was repeated for = 



50, 100,400, 1600 and, for every sample size. Figure 2(a) compares the two box plots 
of the estimated significance levels. The plot clearly shows a lower variability in the 
estimated significance level and a faster convergence to the nominal value (0.05) for 
the test on the LML parameter. 

We also compared the two tests in terms of power. To this aim, we replicated the 
same series of simulations using a sequence of unconstrained probability vectors vTj, 
2 = 1,..., 40, generated uniformly over the simplex. Then, for every i = 1, . . . , 40 we 
estimated the type II error of the two tests, and /37, through the percentage of 
accepted models in the 5.000 random samples and, from these, the difference in power 



5i = P]' — . Figure 2(b) gives, for every sample size = 50, 100, 400, 1600, the box 
plot of the quantities 5j, i = 1, . . . , 40. The plot shows a clear gain in power for the 
test based on the LML parameter with respect to the test based on the multivariate 
logistic parameter. 

Our simulation results suggest that the property of LML models of being code- 
dependent can be exploited to improve the efficiency of model selection. Therefore, 
LML models can be used to search parsimonious bidirected graph models in a lower 
dimensional space defined by constraints which are both interpretable and can be 



tested with powerful inferential procedures. In Section 6.1 we present an effective 
application of models defined by both marginal and code-specific independencies, 
under maximal count coding. 

5.3 Pivotal code-specific LML models 

In this subsection we describe an application of LML models that fully exploits the 
advantages deriving from their code dependent nature. In statistical applications it is 
often of interest to investigate the behaviour of small sub-populations. For instance, 
one of the challenges of modern medicine is personalized medicine, in which therapies 



are tailored to the exact biological state of an individual; see Nicholson (2006). From 
a statistical viewpoint, this involves the stratification of individuals according to some 
features Xy so as to identify a sub-population of interest, which in our perspective 
corresponds to a single cell of the cross-classified table. Researchers are then interested 
in statistical models that may highlight relevant features of such a sub-population, 
and in this context LML models provide a powerful tool of analysis. More specifically, 
if the variables are suitably coded, LML models allow one to focus on the probability 
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that a unit of the population belongs to the sub-population of interest and to specify 
a rich class of factorizations for such probability. 

The first step of the analysis requires that the levels of Xy be coded so that the sub- 
population of interest corresponds to the pivot cell of the table and, accordingly, the 
probability of belonging to such sub-population is the pivot probability of the coding. 
Indeed, the key feature of LML models is that the cells of the table are not on an equal 
footing. The pivot cell has a central role and every code-specific independence implies 
a factorization of either the pivot probability or of a probability in a marginal table 
corresponding to a cell that contains the sub-population of interest. For instance, the 
code-specific independence Xa^Xb\{Xc = Ic}, with possibly C = 0, implies 

pfv _i . _ PiXAuc = '^auc)P{Xbuc = '^Buc) /-.^x 

^[-^AUBUC — ^AUBUC) — T^TTP Z — ^ ) U^J 

-r (^Ac = i-cj 

where we use the convention that P{Xc = Ic) = 1 for C = 0. lfAUBUC = V, 
then P{Xaubuc = ^aubuc) = vry = /iy is the pivot probability, but also the case 



AU B U C C\^isof interest, as clarified in the application of Section 6.2 



We remark that information on the above factorization of the pivot probability 
could also be obtained by fitting a graphical model to the data, because the conditional 
independence relation Xa-^Xb\Xc implies Xa-^Xb\{Xc = Ic} and, in turn, the 
factorization in (12). However, the converse implication does not hold and therefore 
LML models can encode factorizations of the pivot probability that cannot be encoded 
by any graphical model. Furthermore, consider the case where different factorizations 
of the pivot probability are suggested by different classes of graphical models. In this 
case such factorizations can be encoded into a single LML model, which can then be 
used to investigate to what extent they may simultaneously hold. 



6 Applications 



In this section we describe two applications of LML models. The first one refers 



to Section 5^ and shows how LML models can be used to identify interpretable 
parsimonious bidirected graph sub-models. The second concerns the selection of a 



pivotal code-specific LML model as described in Section 5.3 



6.1 Coppen data 

Table [l] shows data from Coppen ( 1966 ) for a set of four binary variables concerning 
symptoms of 362 psychiatric patients. The symptoms are: Xi = stability (0 = 
extroverted, 1 = introverted); X2 = validity (0 = energetic, 1 = psychasthenic); 
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Table 1: Data from 



Coppen 



( 1966 ) on symptoms of psychiatric patients. 



X3 1 

X2 Xi 1 1 

1 30 22 22 8 

1 32 16 27 12 

15 23 25 14 

1 9 14 46 47 



X3 = acute depression (0 = yes, 1 = no); X4 = solidity (0 = hysteric, 1 = rigid). 
The maximal cell coding has been applied. 

We analyse these data because they have been analysed twice, in the graphical 
modelling literature, with two different classes of models: Wermuth (1976) anal- 
ysed them by means of conditional independence models, and found the model in 
Figure [Sj^a). More recently, Lupparelli et al. (2008) analysed the same data using 
marginal independence models, and obtained the model in Figure |3](b). Both the 
undirected 4-chain in Figure [sj^a) and the bidirected 4-chain in Figure [3]^b) achieve a 
good fit, but they encode different independencies. 

Since there is no probability distribution for binary variables that simultaneously 
satisfies all and only the (conditional) independencies encoded by the two graphs in 
Figure |3} it is not possible to reconcile the different results of the two analyses without 
adding further independence relationships. However, such an addition would make 
little sense here, because it would lead to a model with poor fit. LML models allow 
us to follow a different approach: by including in a single model all the marginal 
independencies encoded by the bidirected graph and the code-specific independencies 



1 2 
• • 

• • 

4 3 

(a) (b) 

Figure 3: Independence models for Coppen data: (a) the 4-chain undirected graph 
model XiXX|3,4||X2, X4XX|i,2}|X3; x%) = 13-9 (p- value = 0.09), BIC = -33.26; 
(b) the 4-chain bidirected graph model Xi_LLX{3^4}, X4_LLX{i_2}; X(5) = 8-6 (p- value 
= 0.13), BIC = —20.85. BIC values with saturated model as baseline [BIC sat = 0). 
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suggested by the undirected graph and the chosen variable coding, we partially recover 
the conditional independencies encoded by the undirected graph in a bidirected graph 
sub-model. 

By Theorem |5] the bidirected 4-chain represents an LML model, namely, the model 
given by the constraints 713 = 714 = 724 = 7134 = 7124 = 0. We can simplify this 
model by exploiting the conditional independencies encoded in the undirected 4-chain 
to include further independence constraints in some partial tables: 

XiXX{3,4}|{X2 = l} and X{i,2}XX4|{X3 = 1}. 

As a consequence of Theorem [6} these additional independencies correspond to six 
sum-to-zero constraints on the LML parameter. However, because of redundancies 
with the marginal independence model, these independencies can be included by 
adding only three linear constraints: 

7l3 + 7123 = 0, 7134 + 71234 = and 724 + 7234 = 0. 

Together with the original constraints, these amount to assuming 7123 = 71234 = 
7234 = 0. The resulting LML model fits the data with xfs) ~ 11-45 (p- value = 0.18) 
and BIC = —35.68, thus improving on both models in Figure |3] 



6.2 HIV data 



Table 2 contains data from Guaraldi et al. (2011 ) for four binary variables observed on 



2 860 HIV-positive patients: H = hypertension (0 = yes, 1 = no); E = cardiovascular 
event (0 = no, 1 = yes); A = age (0 = greater than or equal to 45, 1 = less than 45); 
G = gender (0 = female, 1 = male). In this application, researchers are interested 
in the sub-population of young males without hypertension having had some kind of 
cardiovascular event (such as a stroke or a heart attack). For this reason we have 
coded the variables in such a way that this sub-population corresponds to the pivot 
cell (1, 1, 1, 1) of Table |2| which only contains 6 observations. 

Cardiovascular events among HIV-positive young men with no hypertension are 
much less rare than cardiovascular events in the corresponding HIV-negative sub- 
population. Consequently, researchers are interested in the effect of specific HIV- 
related risk factors, such as CD4 counts, drugs and length of HIV infection, on the 
response probability nv- This can be assessed by testing the hypothesis that the 
logistic regression coefficient corresponding to a given risk factor is equal to zero, 
possibly in the presence of other covariates. However, as routinely occurs in per- 
sonalized medicine, the sub-population of interest is very small and it is well-known 
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Table 2: Data from 
pivot cell is in bold. 



Guaraldi et al. 



(2011) concerning HIV-positive patients. The 



A 1 

H E G 1 1 

93 320 33 120 
1 4 51 3 10 

1 296 614 626 665 
1 3 13 3 6 



that the sample size required to achieve an adequate power in such a test increases 
as vTy gets closer to zero; see Agresti (2002, Section 6.5.2) and Whittemore (1981). 
More concretely, Peduzzi et al. (1996) provided guidelines on the minimum sample 
size required in terms of events per variable, and suggested that in a logistic regres- 
sion analysis there should be at least 10 events for every covariate. According to this 
rule, the 6 events observed in the pivot cell of Table [2] are not sufficient to reliably 
identify the effect of any risk factor, and it becomes of interest to assess whether it 
makes sense to focus on cells of marginal tables with higher number of observations. 
For this reason, we use LML models to investigate the independence structure of the 
sub-population corresponding to the pivot cell and find simplifying factorizations of 
the pivot probability. 

We considered the distribution of = 1} for every U indexing a subvec- 

tor of Xy = {H,E,A,G) such that W = V\U, with \W\ = 1,2, and for each such 
distribution we checked for marginal independencies, that is, for code-specific inde- 
pendencies of Xv- We obtained in this way the code-specific independencies listed 
in the third column of Table [3] Then, we fitted LML models starting from the sat- 
urated model and introducing each code-specific independence of Table |3| from top 
to bottom, in a stepwise manner, rejecting the code-specific independencies leading 
to a small p- value/increase in BIG. The last three columns of Table |3] report the 
deviance, the degrees of freedom and the BIG index of each model r(EI) taken into 
consideration. An adequate fit with xf4) = 7.71 (p- value = 0.10) is achieved by the 
LML model that jointly satisfies the code-specific independencies 

HAL{A,G)\{E = 1} and EALG\{A=1}. 

Hence, according to the selected LML model, the joint probability of the pivot cell 
can be factorized as 

T^HEAG = (1'3J 
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Table 3: Marginal independence models for the partial distributions associated to 
= 1} and stepwise search of an overall LMLmodel for the full vector Xy- 
The BIG values are computed with the saturated model as baseline {BICsat = 0). 
The term present means that the code-specific constraint is already contained in the 
selected model. 
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and, moreover, the pivot probability induced on the marginal table of {E,A,G) can 
be factorized as 

TT+EA+ >^ T^++AG /..x 
T^+EAG = • (14) 

Notice that, for the sake of concreteness, here we slightly changed our indexing 
notation, so that for instance ttheag = P{H = 1,E = 1,A = 1,G = 1) and 
nHE++ = PiH = l,E = l). 

Since the factorizations (13) and (14) hold simultaneously, we can replace ti+eag 
in ( 13 ) with ( 14 ) and show that the selected LML model is defined by the factorization 

ttheag = (15) 

corresponding to the log linear expansion of fiy = ^ry 

log TChEAG = Id + IH + lE + 7a + IG + iHE + lEA + 7AG, 



where the missing interactions are due to the fact that the factorization ( 13 ) is encoded 
by the hnear constraints '^hag + Iheag = 0, 'Jha + Ihea = and 'Jhg + Iheg = 0, 
whereas the factorization (14) is encoded by the linear constraint 'Jeg + Ieag = 0. 

It follows from our analysis that the pivot probability can be written as a func- 
tion of probabilities corresponding to sub-populations that are larger than the sub- 
population of interest. More specifically, the cells corresponding to nHE++, 'n'+EA+ 
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and T^++AG contain 25, 22 and 801 events, respectively, and it can thus be helpful to 
investigate the role of risk factors in these sub-populations. 

We remark that the amount of simplification of the pivot probability achieved by 
(15) cannot be achieved by applying ordinary graphical models. Indeed, we analysed 
the data with bidirected graph models, but only the saturated model provided a good 
fit. We also analysed the data with undirected graph models, and the only model 
with a good fit was the model E }LA\{H,G), with X(4) = 8.45 (p-value = 0.08). The 
latter model identifies the code-specific independence E ALA\{H = 1,G = 1}, which 
is also identified by our procedure, although it is not included in the selected model. 



7 Extension to non-binary categorical data 

This paper is devoted to the theory of LML models for binary variables, nevertheless 
it is useful to discuss briefly the extension of these models to the case where the 
variables have an arbitrary number of levels. The main point is that LML models are 
intrinsically binary in the sense that their generalization can be conceptually obtained 
by means of a sequence of dichotomizations of the non-binary variables and then by 
iteratively applying the procedures developed for the binary case. 

Assume that Xy = {Xy)y^v is a discrete random vector with X^ taking values in 
{0,1,..., dy} so that the state space of Xy is Zy = X {0,1,..., d^}. For every 
V E V and i = iy E introduce the Bernoulli random variable Xv'^ defined as 



1 if Xy = z„ 
otherwise 



where, for D C V, is the subset of the levels of i taken by the variables in X^ 
and, accordingly, is the level of X^. In this way, for every i E Xy, the random 
vector Xy^ follows a multivariate Bernoulli distribution with probability parameter 
7r(*) = ['K^j^)£)cy and mean parameter /x^*^ = Ztt*^*). We remark that here tt^^ = 

^V\D 



P{X^^ = lD,Xy}j^ = Oy\D) and, since xjj^ = 1^ if and only if Xd = then the 



probability table of Xy can be written as {vTy^jjgj^. 

Let {1, . . . , (it,} be the state space of X^ with the level "0" removed. Drton (2009 
eqn. 3.3) considered the restricted state space Jn = XveD {1, • • • , dy} for D C and 
defined the saturated Mohius parameters as the collection of marginal probabilities 
given by 

P[Xd = 3d) for every jo G and D C ^ with D^ib. 



Drton (2009, Lemma 7) showed that there exists a bijective linear map from the cell 



probabilities to saturated Mobius parameters and provided a closed form expression 
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for the inverse map. Furthermore, Drton (2009, Theorem 8) generahzed to the non 



binary case the connection between bidirected graph models and factorization of 
saturated Mobius parameters. It is straightforward to see that /i^^ = P{X^^ = 
Id) = P^Xd = jo) and therefore, in our notation, the saturated Mobius parameters 
can be written as the collection of vectors 

11^^^ for every j = jy G Jv- (16) 



The representation in ( 16 ) is redundant because for every j, / G Jv such that jr, = j'j^ 



it holds that /i^^ = However, it is very useful because it writes the saturated 

Mobius parameters as a collection of Mobius parameters for binary variables. We 
can thus parameterize the distribution of Xy by means of the collection of LML 
parameters defined as 

^(J) = log for every j & Jy (17) 

and, as a consequence, several properties of the LML parameters for the non-binary 
case follow immediately form the iterative application, for every j G Jy of the corre- 
sponding properties for the binary case. 

For instance, let A and B be two disjoint, proper, subsets of V . Then, Xa^Xb 
if and only if xjf'^ AL X^^^ for every j G Jy\ see also |Drton| ( |2009| Theorem 8). Hence, 
by iteratively applying Theorem [2] for all j G Jy one obtains that X^ -LL Xb if and 
only if 7^^,^^, = for every A' <Z A and B' C B such that A' ^ ^ and fi' 7^ and for 
every j e Jy. 

In a similar way, it is possible to generalize the definition of code-specific inde- 
pendencies by iteratively applying Theorem |6| However, a formal generalization of 
this idea is less straightforward and it is deferred to a future paper. Here, it is worth 
remarking that a vector 7'^-'^ is directly associated with the cell j G Jy thereby mak- 
ing it possible to focus on the factorization of the probability TTy\ In other words, 
every cell j G Jy is a pivot cell of the table. The LML parameters satisfy the upward 
compatibility property also in the non-binary case but, interestingly, in this case they 
also satisfy the additional property that they remain unchanged if some levels of the 
variables are merged to the level "0" . 



8 Discussion 

We have introduced the LML parameterization for binary data and described two 
areas of application of the family of models defined by imposing linear constraints 
on such parameters. We deem that the full potential of these models is shown in 
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applications where the main interest is for a sub-population corresponding to a given 



cell of the table as illustrated in the analysis of the HIV data in Section |6.2[ The gen- 
eralization of this kind of applications to non-binary categorical variables is promising 
because it allows one to focus simultaneously on more than one pivot cell. 

The second area of application concerns bidirected graph models where LML 
models are shown to provide a worthwhile alternative to marginal log-linear models. 
The advantages concern the efficiency of the algorithm for the computation of MLEs 
but, more importantly, the possibility to identify interpretable parsimonious sub- 
models by searching for code-specific independencies. In this kind of application, 
variable coding is arbitrary and distinct codings will typically result in models defined 
by different sets of code-specific independencies. However, it is not uncommon for 
a model selection procedure to produce results that, to some extent, depend on the 
way it is implemented, and we deem that the interpretability of the selected model 
and efficiency of testing procedures are of great importance and worth some sacrifice. 
Furthermore, when the sample size is small (compared to the dimension of the table) 
and therefore the distribution of the test statistics is not well-approximated by its 
asymptotic version, then the coding can be driven by cell counts in such a way that 
code-specific independencies are searched in partial tables with larger cell counts 
thereby improving the asymptotic properties of statistical procedures. 

In the recent statistical literature, a certain degree of attention has been posed on 
models encoding conditional independencies in partial tables. H0jsgaard (2004) intro- 
duced the family of context specific interaction models which generalize the theory of 
log-linear models for contingency table so as to make it possible to search for pairwise 
conditional independencies in partial tables. Context specific interaction models are 
not code-specific but they are not suited to deal with constraints defined on marginal 
distributions. Huang and Frey (2011) developed the theory of Cumulative Distribu- 
tion Networks (CDNs) which, for certain graph structures, encode the same marginal 
independences as the bidirected graph with the same connectivity, together with some 
additional conditional independence constraints. In the case of CDNs defined over 
ordered discrete variables, such additional conditional independencies include also 
the so called min-independence that is a special case of code-specific independence. 
Unlike CDNs, in LML models the collection of code-specific independencies specified 
by the model is arbitrary and not constrained by the graph structure. Nevertheless, 
it would be of interest to investigate whether in the binary case CDNs can be defined 
subclass of LML models. 

The applications of Section |6] implement naive model selection procedures, but the 
general issue of developing model search strategies for the exploration of independence 
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structures in partial tables is still open, and would be crucial to deal with large tables. 
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A Basic lemmas 

Lemma 1 For any set D $ it holds that Y^ecd (-1)'^' = Y^ecd (-l)'^^^' = 0. 

Proof. It is well-known, and can be proven by induction, that any non-empty set D 
has the same number of even and odd subsets. □ 
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Lemma 2 Let g{-) be a real-valued function defined on the subsets of a set D. If two 
nonempty, disjoint, proper subsets A and B of D exist, such that AU B = D and 
g{E) = g{E n A) + g{E n B) for every E C D, then J2ecd (-1)1^^^! giE) = 0. 

Proof. If we set h = XIscd (—1)'^^'^' giE), then we have to show that h = 0. Since 
A and B form a partition of D, we can write 

A'CA B'CB 

where A' = E n A and B' = E n B. Then, from the fact that AnB = A' n B' = 
A' nB = B' nA = ^ it follows both that (-1)I(^ub)\(a'ub')I = x (-1)\b\b'\ 

and that g{A' U B') = g{A') + g{B'). Hence, we obtain 

h = {9{A') + g{B')} 

A'CA B'CB 

A'CA B'CB 

A'CA I B'CB B'CB ) 

By assumption i? 7^ 0, so that Lemma [l] implies XIb'cbI^-'-)'^^^'' ~ ^ ^^"^ thus 

^ = E (-1)'^'"'' (E(-i)'"'"''^(^')V 

A'CA I B'CB J 

Since we also have A 7^ 0, Lemma [l] also implies that XIa'ca (—1)'^^^ ' = and 
therefore that h = 0, as required. □ 



B Long proofs 

B . 1 Proof of Theorem H 

We first show (i)<^(ii). The implication (i)^(ii) is straightforward. To prove that 



(i)<^=(ii) we use the same argument as in the proof of Theorem 1 of Drton and Richard- 



son 



(2008), which for completeness we now give in detail. 



We want to show that for every iaub € ^aub it holds that 

P{Xaub = Uub) = PiXA = ia)P{Xb = ib) (18) 
and we do this by induction on the number of Os in lAyjB-, which we denote by k, 



with < k < \AU B\. More precisely, point (ii) implies that the factorization (18) is 
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satisfied for k = 0, also wlien A and B are replaced with proper subsets, and we show 
that if such factorization is satisfied for every k < j < \AU B\ then it is also true for 
k = j. Since j > 0, there exists v & AU B such that iv = and, in the following, we 
assume without loss of generality that v E A, and set A' = A\{v}. Hence, 

P{^AUB = iAus) = P{Xa"JB = iA'UB) — P{Xa'UB = ^A'VJB^Xy = 1) 

= P{Xa' = tA')P{XB = ib) - P{Xa' = tA',X, = 1)P{Xb = ts) 

= {PiXA' = tA') - P{Xa' = lA'.X, = 1)}P{Xb = ib) 

= P{Xa = ia)P{Xb = Ib) 

as required; note that the factorizations in the second equality follow from (ii) and 
the inductive assumption, because the number of Os in ia'vjb is J — 1, and furthermore 
that for the case A' = we use the convention P{Xa' = ^a') = 1 and P{Xa' = 
= 1) = P(X, = 1). 

We now show (ii)<(=^(iii). The implication (ii)^(iii) follows by noticing that 70 = 
J2ecd g{E), where g{E) = log fiE- Hence, if we set D = A' U B', with A' 

and B' as in (iii), the statement in (ii) implies that for every E ^ D 

g{E) = log fiE = log fiA'HE + logfiB'nE = giA' nE)+ g{B' n E) 

so that the equality 7d = follows immediately from Lemma |2j We next show that 
(ii)<^=(iii) by induction on the cardinality of AU B, which we again denote by k. 

We first notice that the identity fiAuB = fJ-A >^ fJ'B trivially true whenever either 
v4 = or 5 = because /i0 = 1. Then, if \AU B\ = 2, so that \A\ = \B\ = 1, 
lAuB = implies fiAuB = x /^b as an immediate consequence of the identity 
7aub = log ji^ji^- Finally, we show that if the result is true for 1^4 U i?| < k then it 
also holds for 1^4 U 5 1 = k. To this aim, it is useful to introduce the vector /i* indexed 
hj E C AU B defined as follows: 



fiE for E C AU B; 

fj.A>^ f^B for E = AU B. 



Condition (iii) is recursive and, therefore, if it is satisfied for A and B then it is 
also satisfied for every A' (1 A and B' ^ B such that \A' U B'\ < k, that is, such 
that A' U B' G A U B. As a consequence, the inductive assumption implies that 
fJ'A'uB' = fJ'A' >^ f^B' for every A' (1 A and B' C B such that A' U B' ^ A U 5, and 
this in turn has two implications: firstly, we only have to prove that (iii) implies 
Aiyius = /^A X [iB] secondly, we have ^eclaxjb {-l)^^^^^^^^^ logli*E = by Lemma 
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Hence, we can write 



ECAUB 

EcAUB 



ECAUB 

log /UAUB - log fiA - log yUiJ (19) 



and since (iii) implies that 7aub = then (19) leads to fiAuB = I^a^ I^b, and the 
proof is complete. 

B.2 Proof of Corollary |3| 

For i = l,...,r, we introduce the sets A^i = [jj^^Aj and Vi = {D\D C Ai U 
A_i, with both D n 7^ and D n A^i ^ 0} and note that, by Theorem |2| Xa, AL 
. if and only if 7^) = for every D EVi. The mutual independence Xa^ iL ■ ■ ■ _LL 
Xa^ is equivalent to Xa^ -LLX^ . for every i = 1, . . . ,r and, by Theorem [2| the latter 
holds true if and only if 7/) = for every D G IJi=i ^i- Hence, to prove the desired 
result we have to show that V = IJI=i 

It is straightforward to see that C "D for every z = 1, . . . , r, so that I) D IJI=i 
The reverse inclusion V C IJ^^^^ can be shown by noticing that for any D E V 
one can always find at least one set Ai such that D Ci Ai 7^ 0; since D ^ Ai hj 
construction, it holds that D fl A^i 7^ and therefore that D E Vi. Hence, we have 
D G IJi=i every D eV, and this completes the proof. 

B.3 Proof of Theorem [5] 

Every set D C that is disconnected in Q can be partitioned uniquely into inclusion 
maximal connected sets Di, . . . , Dr with r > 2. It is shown in Lemma 1 of 



Drton 



and Richardson (2008) that vr G B(^) if and only if Xj^^ _LL • ■ ■ _LL Xfj^ for every 
disconnected set D <ZV. Hence, it is sufficient to prove that the mutual independence 
_LL • ■ ■ _LL X^,^ holds for every disconnected set D in ^ if and only if 715 = for 
every disconnected set D in Q. 

We assume that D = Di U • • • UZ)^ is an arbitrary subset of V that is disconnected 
in Q and note that, in this case, also every set C U ■ ■ ■ U D^. such that E ^ 
for every z = 1, . . . , r is disconnected in Q. Then, if _LL • ■ • ALXj^^ it follows from 
Corollary |3] that also 71) = 0. On the other hand, if every element of 7 corresponding 
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to a disconnected set is equal to zero, then 7^ = for every E C DiU- ■ -UDr such that 
E % Di for every i = 1, . . . , r and, by Corollary 3 , this implies that _LL ■ ■ ■ _LLX^^. 

B.4 Proof of theorem [6] 

For every pair of subsets D (1 U and W' ^ W = V\U it follows from the definition 
of LML parameter in ^ that 

ECDUW 

which is equivalent to 



J2 iQg^^l i^ECDU W) (20) 

'CDUW 

where l(-) denotes the indicator function. Hence, if we define uj{E, D U W) = 



iDUW — 

ECDUW 



^iy{Duw')\E\ X 1{E C DUW), then we can rewrite (20) as 



ECDUW 



Accordingly, the quantity XlvF'cvi/ Iduw in ( [lO| ) can be written as 

W'CW W'CW ECDUW 

= J2 ^ogfiE J2 ^{E,DUW'). (21) 

ECDUW W'CW 

We now focus on the internal sum of (I2TI), 



J2 co{E,DUW') = ^ (-l)l(^^^')\^l X l(^CZ}Uiy')- (22) 



W'CW W'CW 



Since E, D and W in (22) are fixed and such that DdW = ^ and E (1 DUW, 



we can depict the structure of these sets as in Figure |4j The sum on the right hand 



side of (22) is taken over all subsets W C W but its terms are different from zero if 
and only if W is such that E (1 D U W, that is, if and only if there exists a subset 
H C W\E such that W = {E\D) U H. Formally, the following equality between sets 
holds, 

{W'CW\ ECDUW'} = {W = {E\D)UH\ H CW\E}, 
and, consequently, we can write (p2|) in the form 



J2 oj{E,DUW') = Yl (-l)l^^^(^\^)^^>\^l. 

W'CW HCW\E 
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D 



W 




\ E 



Figure 4: Subset structure in (22). 



It is easy to see that {D U {E\D) U H}\E = {DU H)\E and from the fact that both 
EnH = ^ and Dr\H = ^ii follows that {D UH)\E = {D\E) U H and, furthermore, 
that = X (-1)1^1. We can thus write 

ooiE,DUW') = Yl (-1)1(^^^)1 X (-1)1^1 

W'CW HCW\E 

= Y (-1)'^'. 

HCW\E 

Now the sum 

J2hcw\e (-1)'^' is equal to 1 for W\E = 0, that is, for W C E, and 
it is otherwise equal to by Lemma [Tj Hence, the final form of (22) is 



(23) 



W'CW 



Substituting ([23j) in (21) leads to 

Y log/i^(-l)l(^\^)ll(iyci?) 



W'CW 



ECDUW 



|D\(FUTy)| 



FCD 



where F = ^ n £>, and since n = it holds that D\{F UW) = D\F so that 

(24) 



W'CW 



FCD 

We have assumed D 7^ and, in this case, by Lemma [T] 

FCD I FCD J 



(25) 



so that we can subtract rt25p from ( 24 ) and complete the proof as follows: 



FCD FCD 
log ^^^^ 



W'CW 



E (-1)' 



FCD 
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C Algorithm for ML estimation in LML models 



Let = {no)Dcv be a vector of cell counts observed under multinomial sampling 
from a binary random vector Xy with probability parameter vr^ > 0. If we denote 
by = Nn^ the expected value of n^, where = l'''n^ is the total observed 
count (sample size) and 1 is the unit vector of size i?^'^' . We can deal with maximum 
likelihood estimation of vr^ by considering rt^ as coming from Poisson sampling with 
parameter ip^ > and, in this case, we will find "ipy = N and N~^-i}j^ = . Thus, 
using the reparameterization = logtp^ to remove the positivity constraint on ip'^ , 
we can write the log-likelihood function (up to a constant term) as 

i{uj; n) = u — l'^ exp{u) , i?^'^', 

where n = and u = . 

The LML parameter 7 is obtained from u through the reparameterization 7 = 
M""" log{Zexp(a;)}, u G i?^'^', so that the linear constraint on 7 defined by El'^7 = 
can be transformed into the following non-linear constraint on u: 

g{u) = m^Wl^ log{Zexp(w)} = 0. 

Maximum likelihood estimation in the LML model defined by EI can thus be formu- 
lated as the problem of maximizing the objective function i{u; n), with respect to u, 
subject to the constraint g{u) = 0. 

A well-known method for the above constrained optimization problem looks for a 
saddle point of the Lagrangian function i{u);n) + Tg{u), where r is a fc-dimensional 
vector of unknown Lagrange multipliers, by solving for u and r the gradient equation 

d£(u;n) dqiuo) 

V + 4r^r = 

together with the constraint equation g{uj) = 0. If a) is a local maximum of £{u:\n) 
subject to g{ijj) = 0, and dg{uj)/duj is a full rank matrix, then a classical result ( [Bert 



sekas , 1982 ) guarantees that there exists a unique f such that the gradient equation is 



satisfied by (a), f). In the following we assume that the maximum likelihood estimate 
of interest is a local (constrained) maximum. 

The gradient equation requires that the gradient of i, that is, the score vector 

di{u};n) 

s[uj; n) = — = n — exp^w), 

000 

be orthogonal to the constraining manifold defined by g{u)) = 0, that is, belong to 
the vector space spanned by the columns of 

^, , dq(u) SiZexpfw)} 91og|Zexp(co')l 

= diagexp(a;) Z^[diag{Zexp(w)}]"^Me, 
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where diagf is the diagonal matrix with diagonal entries taken from the vector v. 
We remark that G{u) has full rank, for all u G -R^'^', because EI has full rank by 
construction. 

Since no closed-form solution of the system formed by the gradient and constraint 
equations is available (in our case) we resort to an iterative procedure inspired by 



Aitchison and Silvey (1958) and Lang (1996). Specifically, we use the Fisher-score- 



like updating equation 



CO 



t+i 



-t+i 







+ 



-G(a;*) 




s{uj^; n) 



to take step t + 1 of the procedure, where and r* (unused) are the estimates of u 
and T (respectively) at step t, and ¥{u) is the Fisher information matrix 

ds{u!] n) 



¥(uj) = -E 



dcu 



-i^l— diagexp(a;)} = diagexp(ci;) 



at cu E R'^ . The above updating equation is obtained using a first order expansion 



of s{u] n) and g{uj) about w*; see Evans and Forcina (2011) for details 



The matrix inversion in the updating equation can be solved block-wise as follows 
(Aitchison and Silvey [1958): 



F(a;*) -G(tJ* 
-Giu'V 



-P 



-1 



where 



Q = -¥{J)-^G{J)¥~\ 

M = F(w*)-i +F(u;*)-iG(w*)Q^. 

Then, introducing the relative score vector 

e{iJ]n) =¥{iJ)~^s{iJ]n) = {diagexp(w*)}"^{n - exp(w*)}, 

the updating equation can be split and simplified as 

^t+i ^ -¥-^{G{ujye{uj';n)+g{u')}, 



t+i 



J + e{J- n) + ¥{J)-^G{jy-^\ 



so that the instrumental role of Lagrange multipliers becomes apparent, and it is clear 
that the algorithm actually runs in the space of u. Notice that the updates take place 
in the rectangular space R^'^ ', so that there is no risk of out of range estimation. 
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Since the algorithm does not always converge when the starting estimate w° is not 
close enough to a), it is necessary to introduce a step size into the updating equation. 
The standard approach to choosing a step size in unconstrained optimization problems 
is to use a value for which the objective function to be maximized increases. However, 
since in our case we are looking for a saddle point of the Lagrangian function, we need 



to adjust the standard strategy. Specifically, Bergsma (1997) suggests to introduce a 



step size in the updating equation for w, which becomes 



with < step* < 1, while the updating equation for r is unchanged, in light of the fact 
that r*+^ is computed from scratch at each iteration. Our choice of step* is based on 
a simple step halving criterion, which has proven satisfactory for our needs, but more 
sophisticated criteria are available. At convergence we obtain 7 = log{Zexp(w)} 
with asymptotic covariance matrix 

cov(7) = J^MJ, 

where J = diagexp((i;) Z^[diag{Zexp((i;)}]^-'^M is the Jacobian of the map 1— )■ 7. 

Finally, concerning the choice of the initial estimate u^, we start from the maxi- 
mum likelihood estimate under the saturated model: this choice is believed to result 
in quick convergence, because it makes the algorithm start close to the data, and our 
experience confirms this belief. If zero cell counts are present, we smooth the data by 
means of a convex combination with the mutual independence table having the same 
sample size and univariate counts as the data, using weights 0.5 for this "prior" table 
and for the actually observed table. 
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